Skip to content

Add scribe stream command for live microphone transcription#1

Draft
javiertoledo wants to merge 7 commits intomainfrom
feature/stream
Draft

Add scribe stream command for live microphone transcription#1
javiertoledo wants to merge 7 commits intomainfrom
feature/stream

Conversation

@javiertoledo
Copy link
Copy Markdown
Member

Summary

  • Add scribe stream command for live microphone transcription
  • Two engines: default (Parakeet TDT v3, multilingual, ~11s latency) and Nemotron (English-only, ~560ms latency)
  • README updated with streaming docs and engine trade-offs

Status

Draft — streaming not working reliably yet. Known issues:

  • Nemotron engine: output shows mixed/repeated text from accumulated transcript diffing
  • Default engine: gets stuck when switching languages mid-stream
  • Default engine: ~11s latency (inherent to SlidingWindow approach with batch model)
  • No system audio capture yet (mic only)

What works

  • scribe stream starts and captures microphone audio
  • scribe stream --engine nemotron downloads and loads the Nemotron model
  • Partial text preview on stderr
  • Both text and JSONL output formats
  • Model download retry on partial/corrupt cache

Architecture decisions

  • Nemotron 560ms via StreamingAsrEngine protocol (true cache-aware streaming)
  • Parakeet TDT v3 via SlidingWindowAsrManager (batch model in sliding windows)
  • Actor-based state for thread safety (Swift 6 sendability)

Test plan

  • Nemotron: speak English continuously, verify clean incremental output
  • Default: speak Spanish, verify transcription appears
  • Default: speak English then Spanish, verify no hang
  • --format jsonl produces valid JSON per line
  • --output file.txt saves to file
  • Ctrl+C exits cleanly

🤖 Generated with Claude Code

javiertoledo and others added 7 commits March 30, 2026 20:29
New command: scribe stream — captures microphone audio and transcribes
in real-time using FluidAudio's SlidingWindowAsrManager (Parakeet).

Features:
- Live transcription from microphone with timestamps
- Text and JSONL output formats
- Save to file with --output
- Ctrl+C to stop cleanly
- Uses streaming ASR config (11s chunks, 1s hypothesis updates)

Usage:
  scribe stream                      # listen and transcribe
  scribe stream --format jsonl       # JSONL output
  scribe stream --output meeting.txt # save to file

System audio capture (--source) will be added in a follow-up.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Reduce chunk size from 11s to 3s for ~3-4s latency (was ~13s)
- Lower confirmation threshold from 0.8 to 0.5 for faster output
- Reduce right context from 2s to 0.5s
- Fix speaker label: remove "Others" tag for mic input
- Add text dedup to avoid repeating same hypothesis
- Remove --mic flag (mic is default and only source for now)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…pdates

The 3s chunk config was too short for Parakeet — model needs ~10s context.
Reverted to the library's .streaming preset (11s chunks, 1s hypothesis).

Now shows two types of updates:
- Volatile (hypothesis): shown as ephemeral line on stderr with \r overwrite
  Gives immediate ~1-2s feedback while speaking
- Confirmed: printed as permanent line to stdout
  Stable, final text after sufficient context

Also fixes:
- Stream getting stuck on longer utterances (was breaking model state)
- Text format shows live preview on stderr, final on stdout
- JSONL emits both volatile and confirmed (with "confirmed" field)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace SlidingWindowAsrManager (batch TDT in sliding windows, ~11s latency)
with StreamingAsrEngine protocol using Nemotron 560ms:

- True cache-aware streaming: each 560ms chunk inherits full context
- 2.12% WER (better than TDT v3's 2.5% on LibriSpeech)
- Includes punctuation and capitalization
- ~560ms to first text (was ~11s)
- Partial transcript callback for live preview on stderr
- Confirmed text printed to stdout

Architecture:
- Mic audio → appendAudio() → processBufferedAudio() → getPartialTranscript()
- Partial callback fires on every chunk for live preview (\r overwrite on stderr)
- Main loop polls at 20Hz, emits new confirmed text to stdout
- Actor-based state management for thread safety (Swift 6)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Default: Parakeet TDT v3 via SlidingWindow (25 languages, higher latency)
- --engine nemotron: Nemotron 560ms (English-only, ~560ms latency, punctuation)

Usage:
  scribe stream                    # multilingual (default)
  scribe stream --engine nemotron  # English-only, low latency

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Nemotron: retry with cache cleanup on failed model load (fixes partial download)
- Both engines: show download progress messages (not just --verbose)
- README: add streaming section with engine comparison and trade-offs
- README: update performance table with streaming latencies

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The Nemotron engine's partial callback returns the full accumulated
transcript each time, which grows and revises. The previous code tried
to diff via getPartialTranscript() polling, causing repeated/mixed output.

Fix: Track printed length in StreamState actor. The partial callback
fires after each 560ms chunk — we diff to find only the new portion
and emit that. Live preview shows the tail of the transcript on stderr
(ephemeral, overwritten). New confirmed text goes to stdout.

Also simplified SlidingWindow engine to only emit to stdout on confirmed
text (volatile goes to stderr preview only).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant